1
Bridging Learning and Planning: The Internal Model
AI029 Lesson 8
00:00

This lecture establishes the conceptual bridge between direct reinforcement learning and planning by introducing the Internal Model. We define the model as any mechanism that mimics the environment's behavior, allowing the agent to predict future states and rewardsβ€”a process often called system identification.

Model Learning Direct RL Planning (Simulated Experience) MODEL EXPERIENCE Value / Policy

The Frozen Lake Analogy

Imagine a self-driving car learning to navigate a frozen lake. Direct RL occurs when the car actually drives, slips on ice, and receives a negative reward, immediately updating its value function. Planning occurs while the car is parked; it uses its Internal Model (a mental map of where the ice was) to simulate thousands of hypothetical turns, updating its policy without ever moving a tire or risking a collision.

Core Insights

  • Indirect RL: Also known as planning, this uses experience to improve a model, which then generates simulated experience to perform the exact same value updates as direct RL.
  • The Internal Model as a Simulator: In tabular methods, this is typically done by recording the transitions and rewards observed; planning then involves "sampling" from this history.
  • Algorithmic Unity: Learning and planning are fundamentally identical in their mathematical execution. They both use reinforcement learning backup algorithms (like Q-learning or Sarsa); the only difference is the source of the experience (real vs. simulated).